Machine Learning Guides Peptide Nucleic Acid Flow Synthesis and Sequence Design

Abstract Peptide nucleic acids (PNAs) are potential antisense therapies for genetic, acquired, and viral diseases. Efficiently selecting candidate PNA sequences for synthesis and evaluation from a genome containing hundreds to thousands of options can be challenging. To facilitate this process, this work leverages machine learning (ML) algorithms and automated synthesis technology to predict PNA synthesis efficiency and guide rational PNA sequence design. The training data is collected from individual fluorenylmethyloxycarbonyl (Fmoc) deprotection reactions performed on a fully automated PNA synthesizer. The optimized ML model allows for 93% prediction accuracy and 0.97 Pearson's r. The predicted synthesis scores are validated to be correlated with the experimental high‐performance liquid chromatography (HPLC) crude purities (correlation coefficient R 2 = 0.95). Furthermore, a general applicability of ML is demonstrated through designing synthetically accessible antisense PNA sequences from 102 315 predicted candidates targeting exon 44 of the human dystrophin gene, SARS‐CoV‐2, HIV, as well as selected genes associated with cardiovascular diseases, type II diabetes, and various cancers. Collectively, ML provides an accurate prediction of PNA synthesis quality and serves as a useful computational tool for informing PNA sequence design.


This PDF file includes:
Materials and Methods Figures S1 to S5 Table S1 to S2 Synthetic UV-vis Traces HPLC, LC-MS traces
Purification and analytical reagents: Water for HPLC was purified to 18.2 MΩꞏcm resistivity on a Millipore Milli-Q system. HPLC-grade acetonitrile was purchased from VWR International (Philadelphia, PA). LC-MS grade acetonitrile was purchased from Sigma-Aldrich (St. Louis, MO). Unless specified otherwise, all other reagents and solvents were purchased from Sigma-Aldrich and kept over activated 3 Å molecular sieves, they were used directly without further purification. Figure S1: Structure of the H-Rink Amide resin and Fmoc-protected PNA monomers.

Analytical high-performance liquid chromatography (HPLC) analysis
All crude PNA samples were dissolved with 30% acetonitrile in water (with 0.1% TFA additive), the solutions were filtered and then diluted to approximately 0.1 mg/mL concentration. The sample purity analysis was carried out on an Agilent 1200 series HPLC system.

Liquid chromatography-mass spectrometry (LC-MS) analysis
The main peaks from the HPLC were loaded onto an Agilent 1290 Infinity HPLC, and the mass was analyzed by an Agilent 6550 Q-TOF with Dual Jet Stream ESI ionization and iFunnel.

Introduction of automated PNA synthesizer 'Tiny Tides'
All the PNA sequences studied in this work were prepared on an automated PNA synthesizer, 'Tiny Tides', which was designed previously in our lab. 1,2 This automated instrument contains seven major modules: a central control computer, three HPLC pumps, reaction zone, a UV-visible detector, a solution storage system, heating elements, and three multi-position valves. All the modules were controlled by a modular script under the Mechwolf programming environment. An overview of the Tiny Tides was presented in Figure S2, and every individual part was labeled with corresponding names. More designing details on this automated synthesizer can be found in our recently published work. 1,2 Figure S2. Overview of the automated synthesizer with the major components labeled in position.

Automated Flow PNA Synthesis and UV−Vis Data Collection
All PNA sequences were synthesized on a fully automated flow synthesizer, which was built in the Pentelute lab and described previously. 1,2 The automated setup records every deprotection reaction efficiency in real-time through an in-line UV-vis monitor. Optimized synthesis conditions, as detailed in our previous publication, 2 were used to synthesize all the PNA sequences. The following stock solutions were used for PNA synthesis: Fmoc and benzhydryloxycarbonyl (Bhoc) protected PNA monomers: Fmoc-A(Bhoc)-aeg-OH, Fmoc-G(Bhoc)-aeg-OH, Fmoc-C(Bhoc)-aeg-OH, Fmoc-T-aeg-OH as a 0.2 M stock solution in DMF,activating agent N,N,N',uronium hexafluorophosphate (HBTU) as a 0.19 M stock solution in DMF, DIEA (10% v/v), and deprotection stock solution (20% piperidine, 2% formic acid, 78% DMF v/v/v). DMF was pretreated with AldraAmine trapping agents >24 h before synthesis. Ten milligrams of H-Rink amide resin (0.49 mmol/g loading) were used in all experiments in the data set. A standard synthesis cycle involves (a) prewashing of the resin, (b) iterative coupling, washing, deprotection, and washing steps per PNA monomer building block. No capping or multiple couplings were needed, and each coupling cycle took 3 minutes. The workflow, timeline, and reagents for a complete coupling cycle were shown in Figure S3. Steps 1-6 were repeated until the elongations of all residues completed. Deprotection was performed with one-part 20% piperidine, 2% formic acid (v/v) in DMF, and one-part DMF for 50 seconds in the room-temperature loop. UV-vis in-line analysis is recorded after passing the reactor and before waste collection. The UV synthesis data at a wavelength of 310 nm were collected for 239 individual deprotection steps from PNA synthesis experiments. The crude samples were cleaved off the resin and characterized with HPLC and LC-MS. Figure S3. Workflow of a complete coupling cycle, performed during automated flow PNA synthesis.

PNA cleavage
Synthesized PNAs were cleaved off the solid support using trifluoroacetic acid (TFA) for purity characterizations. In brief, the cleavage from the resin and all side-chain group deprotection were carried out simultaneously with a cleavage cocktail containing 2.5% (v/v) 1,2ethanedithiol (EDT), 2.5% (v/v) water, and 1% (v/v) triisopropylsilane in neat TFA for 2.0 h at room temperature. Five mL of the cleavage cocktail was used for approximately 0.1 mmol compound. The cleaved crude PNAs were washed with dry ice-cold ether and followed by centrifugation at 4,000 rpm for 3 min for precipitation. The resultant solids were then dissolved in water/acetonitrile (50:50, v/v) and dried through lyophilization.

ML Prediction and Experimental Validation
Six PNA sequences with the length varied from 6-to 18-mer were randomly generated and the synthesis predictions were performed using the optimized ML model. To validate the model prediction results, all 6 PNA sequences were experimentally synthesized on our automated flow PNA synthesizer. UV-vis in-line records were compared with ML-predicted traces. Furthermore, the synthesized PNAs were cleaved off the resin and their crude purities were measured using HPLC. The PNA crude purities were compared with ML-predicted scores and the correlation strength was calculated.

Synthetic UV-vis, HPLC, and LC-MS data for validation experiments
All PNAs including one 6-mer, three 10-mers, one 14-mer and one 18-mer, were synthesized automatically on synthesizer 'Tiny Tides' under the conditions depicted in Section S2.2 workflow. Then PNA cleavage was performed following conditions in Section S2.3. After cleavage, the crude purity of each PNA sample was analyzed with HPLC using the methods in Section S1.2 and determined through integration of the HPLC chromatogram. The major peak of each sample was characterized with LC-MS under the conditions in Section S1.3.
The synthetic UV-vis (310 nm), HPLC, and LC-MS data of all six validation PNAs were shown as below. The wide and tall peaks indicate the coupling steps, and the narrow but sharp peaks after each coupling peak represent fluorenylmethyloxycarbonyl (Fmoc) deprotection. Both of the coupling and deprotection peaks were labeled with a "×" symbol. The relative peak areas were integrated through a python script. Of note, there are some undesired peaks right after the deprotection peak, which was caused by the 'air bubble' generated during valve switch, and these peaks were omitted in the machine learning training data set. Sample: 6mer in Figure 6 Sequence: GTGAAC-KKK-CONH2. Synthesis method: Automated flow synthesis. Resin: 10 mg Rink Amide resin (0.49 mmol/g). Tiny Tides in-line synthetic data at UV absorbance 310 nm.

Analysis of features using n-grams representation approach
We analyzed the features in the raw dataset weighted by the synthetic yield for the presynthesized sequences. All the features, length, 1-mers, and 2-mers, were used for the analysis. Each unique deprotection step, , had associated independent features set, , and the averaged synthetic yield, . We multiplied the feature values with synthetic yield of each unique deprotection step and summed across the feature, for instance, . All features were normalized by the sum, to obtain the respective contribution of each feature to the synthetic yield.
We observed that length was a key contributor to synthetic yield, followed by the monomeric nucleic acid composition ( Figure S4). In line with common intuition, length was anti-correlated with synthetic yield, i.e., as length of the sequence increases, the synthetic yield decreases. The composition of all monomers, T, C, A, and G, in the same order, in the presynthesized sequence was noted to be more important than the dimer composition. Figure S4. Analysis of features in the raw dataset. A. Individual features important to the synthetic yield are obtained by weighted analysis of different independent features. x-axis represents the features that we selected to train the models, and y-axis represents the importance (or 'contributions') of each feature to the model performances. B. Synthetic yield of the PNA sequences is anti-correlated with pre-synthesized sequence length. The dotted plot on the bottom left represents an anti-correlation trend between the length of PNAs (on the x-axis, labelled as 'Length') and the synthetic yield (on the y-axis, labelled as 'Area'). The histogram subpanel on the bottom right shows the summary of the synthetic yield for all sampled PNA sequences. The histogram subpanel on top indicates the length distribution of all analyzed PNA samples.

Analysis of features using Ridge model
Data mining over the training data set informs the feature importance for model performance. As depicted in Figure S5, the relative feature importance contributing to the model prediction was summarized with the Ridge. A higher feature weight indicates a larger contribution to the model prediction. In line with the common intuition, the PNA chain length was ranked as a top important feature by the model ( Figure S5). In addition, besides the sequence length, we observed that four PNA monomers, i.e., guanine (G), thymine (T), cytosine (C), and adenine (A), contribute significantly to the model performance. Overall, chain length and four monomers play a more important role than any of the 16 possible dimer permutations with respect to our model performance, and this observation is consistent with a raw data analysis using n-grams representation approach (supporting figure S4).